1 Overview & Motivation

Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.

We are interested to know How the worlds news media is covering the COVID-19 pandemic? Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.

A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.

We intend to solve the huge flow of information called information overload which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. Its costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.

We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.

3 Research Questions

Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited and interpretable for topic modeling on our dataset ?
3. How does the topic model perform with different features, namely Term frequency Inverse document     frequency (Tf - Idf) along with Bag of Words and Bag of words (TF) by itself.

4 Dataset

5 Exploratory Analysis

5.1 Distribution of Articles :

The news articles were distributed over the below categories.These categories are nothing but the keywords that were used to collect the articles by the GDELT project. Although we do not intend to use these labels assigned to each article, but in order to avoid any biased results, we’ve taken a fair distribution of articles from every set.
...

5.2 Pre-processing :

Our Dataset is in text format and therefore we pre-processed it before performing any kind of exploratory analysis. This was required in order to clean it and remove unnecessary words or characters that would affect our analysis in any way.Pre-processing is one of the very important steps of Natural Language processing, because a well pre-processed data speeds up the computation time required for further analysis and also the quality of tokens and results tend to be higher compared to the poorly pre-processed data.

Steps taken for Pre-processing

  • Removed URL’s from the content

  • Replaced punctuations, numbers and any other characters apart from alphabets

  • Coverted Latin words to Utf-8

  • Conerted the text to lower case

  • Removed Stop words

5.3 Wordclouds

Wordclouds are a representative of underlying words in any text or the news articles dataset in our case.We are interested in knowing the most prominent words in the corpus. To do so we generated wordclouds for 2 different models of Bag of Words, that are with Term Frequency and Term frequency - Inverse document frequency.

5.3.0.1 Term Frequency

As we can see in the below wordcloud, news articles have been all about the coronavirus pandemic. The Terms with higher frequencies have bigger fonts.The words from Bag Of Words Model are more evident in the word cloud since they are weighted by the term frequency. The words in the TfIdf model is weighted according to the TF-IDF scale, so they look uniform.

wordcloud2(df_bow_content,shape = "star",size = 0.4)
...

5.3.0.2 Term frequency - Inverse document frequency

wordcloud2(df_tfidf_content,shape = "star",size = 0.15)
...

To understand the most prominent terms in the article titles, we created a word cloud for the titles of the articles in the corpus.

wordcloud2(df_bow_title,shape = "star",size = 0.15)
...

5.4 Document length & Unique word count

The document length for all the documents are represented in the form of scatter plot. Most of the documents are in the range of 0-12500.

doc_size<-ggplot(test_df, aes(x=ID, y=doc_length)) +geom_bar(stat="identity",aes(fill = ID))+theme_minimal()+ labs(y= "Size", x = "Document")
ggsave("Document_Size_Bar.png", plot = doc_size,height = 5, width = 7)
doc_size_scatter <- ggplot(test_df) + aes(x = X1, y = doc_length) +geom_point(size = 1L, colour = "#0c4c8a") +labs(x = "Document", y = "Document Length") +theme_minimal()
...

The unique words present in the document was represented via the density plot

density <- test_df %>%
  ggplot( aes(x=unique_words)) +
  geom_density(fill="#009E73", color="#F0E442", alpha=0.8)+ labs(y= "Document", x = "Frequency")
density + coord_cartesian(xlim=c(0,20000))
theme(axis.text.x = element_text(face = "bold", color = "#993333", 
                           size = 12, angle = 45)
...

5.5 Word Frequencies

The below words were the top 10 most frequent occurring words.

docs <- Corpus(VectorSource(test_df$pre_process_content))

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
params <- list(minDocFreq = 1,removeNumbers = TRUE,stopwords = TRUE,stemming = FALSE,weighting = weightTf,tokenize=UnigramTokenizer) 
dtm <- DocumentTermMatrix(docs, control = params)
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm, 1, sum) #Find the sum of words in each Document
dtm  <- dtm[rowTotals> 0, ]

dtm_uni_freq <-dtm %>%
  as.matrix %>%
  colSums %>%
  sort(decreasing=TRUE)  
dtm_uni_freq_d <- data.frame(word = names(dtm_uni_freq), freq = dtm_uni_freq)
head(dtm_uni_freq_d, 10)
...

5.6 News Progression

We now see the number of articles published over the duration of January to April

p <- test_df2 %>% 
  ggplot(aes(x=Date, y=Count,group = 1)) +
  geom_area(fill="#69b3a2", alpha=0.5) +
  geom_line(color="#69b3a2") +
  ylab("Article Count") +
  theme_ipsum() +
  theme(axis.text.x = element_text(face = "bold", color = "azure4", 
                                   size = 8, angle = 90),panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p <- ggplotly(p)

5.7 Sentiment score & emotions

Let us see the overall sentiment in the published articles

ggplot(test_df1, aes(x=ID, y=as.integer(sentiment))) +
  geom_segment( aes(x=ID, xend=ID, y=0, yend=as.integer(sentiment), color=mycolor), size=1.3, alpha=0.9) +
  theme_light() +
  theme(
    legend.position = "none",
    panel.border = element_blank(),
  ) +
 labs(y= "Sentiment", x = "Document")
...

In order to get a better understanding of the most prevalent emotions in the articles, we have visualized the strength of the emotion in the corpus

quickplot(Emotions,data=all_emot, weight=count, geom="bar",fill=Emotions,  ylab="count")+ggtitle("Emotion Analysis")
...

6 Topic Modeling

The topic Modelling is an unsupervised method that is used to deduce the abstract topics discussed over a collection of documents. Since the aim of our project is to classify documents, we have used topic modeling as a means to label our data. Once we have the labeled data, the unseen test documents are classified based on the topic probabilities.

6.1 Search for optimal topic number :

The first step in performing LDA is to deduce the optimal number of “topics”. This is achieved by using the perplexity measure.Since all the topics are represented by probabilities, we need to measure how well these distributions predict a sample, so we use perplexity. The perplexity measure is applied on LDA objects with k ranging from 10 to 30 for both the Bag Of Words model and the TF-IDF Model. The LDA object with the lowest K is deemed to be the best model, and k is deemed to be the optimal number of topics.

...

So in our case, the best model turned out to be the Baf Of Words model, with K=25 topics.

6.2 Model building

The LDA model was built for term frequency with k = 25 topics. Both the Gibbs Sampling and the Dot product was used for this purpose.

Model for term frequency with Gibbs sampling

LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
                                 iterations = 200, burnin = 175)
    p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], 
                      method = "gibbs",iterations = 200, burnin = 175)
    p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")

Model for term frequency with Dot product sampling

LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
                                 iterations = 200, burnin = 175)
    p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], 
                      method = "dot",iterations = 200, burnin = 175)
    p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")

So now each document has a probability associated with it with respect to the 25 topics. This acts as the labelled data for further prediction.

6.3 Prediction

Once we have all the documents in the training set labeled, the next step is predicting the topic probabilities for the unseen test set. The predict method of LDA is used to predict the topic probabilities.

Predicting the topics using Term frequency with Gibbs sampling model

p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "gibbs",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with Gibbs sampling model can be seen in the plot below.
...

Predicting the topics using Term frequency with Dot product sampling model

p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with dot product sampling model can be seen in the plot below.
Perplexity Score

Perplexity Score

6.4 Model evaluation

The next step is to evaluate the model, for which we used log likelihood. Higher the value, the better is the model. The plot below shows the log likehood for the two models.

Perplexity Score

Perplexity Score

From the plot it is evident that the bag of words model (term frequency) performs better and hence we have used this model here forth.

7 Associations & similarities between topics

Once the prediction is done, we now have topic probabilities for all the documents. It is interesting to find similarities in-between topics, so we are clustering the documents based on their topic probabilities.

7.1 Finding optimal number of clusters

7.1.1 Elbow curve

To perform clustering, we need to decide on the optimal number of clusters. This was determined by using elbow curve The optimal number of clusters by elbow curve is 8.

#Reducing the dimensions via tsne
tsne <- Rtsne(doc_topics_gamma[,-1], perplexity = 30, pca = FALSE, check_duplicates = FALSE)
X <- data.frame(tsne$Y)

#Find best no. of clusters for 25 topics
wss <- (nrow(X)-1)*sum(apply(X,2,var))
for (i in 1:100) wss[i] <- sum(kmeans(X,iter.max = 50L,centers=i)$withinss)
plot(1:100, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Perplexity Score

Perplexity Score

7.1.2 Silhouette coefficient

Another approach to find optimal number of clusters used was silhouette coefficient. The silhouette coefficient is used to determine the inter and intra distance for all the points within the cluster to themselves and to the points in the other cluster. We evaluated this value for 8 & 15 clusters and the results can be seen in the plots below.

Perplexity Score

Perplexity Score

Perplexity Score

Perplexity Score

The silhouette coefficient for our cluster was 0.33. Given our dataset where all our documents are talking about coronavirus, its no wonder the value for silhouette coefficient is less as the distance between the cluster is negligle and thus the documents within them.

7.2 Clustering

Finally, the articles were grouped into 8 clusters.

k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)

fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

7.3 Topics association to clusters

It would be interesting to see how the topics are associated to the clusters. The chord diagram shows the association of each of the topics to the clusters.
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

To line chart represents the behavior of the topic in each of the clusters. The chord diagram shows the association of each of the topics to the clusters.
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

8 Visualization of the corpus

The entire document corpus has been visualized in the RBokeh graph. On hovering on the documents, it can be seen that the documents belonging to the same topics are relatively close to each other. However, some exception exists near the boundaries of each topic.Hovering over the documents, displays the title, URL, and the most dominant topic in it.

Similary we plotted the articles from the test and train set in rbokeh.The purpose of doing so is view the performance of our model. From the plot it can be seen that topics predicted for the documents in the test set, lie in the same region as the documents belonging to the same topic in the training set.

The pie chart below gives the distribution of the topics dominant in each article of the whole corpus.

9 Topic term Association

The below section represents the association between the top terms in the document and the topics generated from the document

Terms in Topics

Terms in Topics

10 Topic progression

Finally, the progression of the topics that were mainly discussed during the initial months of the pandemic are displayed below

11 Conclusion

The news articles published in the past few months discussed about different aspects of coronavirus. Progression of topics during the duration of January to April was evident. Amongst our models, LDA with Bag of words - term frequency had better performance as compared to other models. Our model was successful in predicting topic distributions for test articles.We were therefore able to find similar articles, given an unseen test article.

12 References

13 Team members

Name Email-Id
Calida Pereira calida.pereira@st.ovgu.de
Chandan Radhakrishna chandan.radhakrishna@st.ovgu.de
Nandish Bandi Subbarayappa nandish.bandi@st.ovgu.de
Mohit Jaripatke mohit.jaripatke@st.ovgu.de
Priyanka Bhargava priyanka.bhargava@st.ovgu.de